Did you know that living area and selling price are effective in determining when a house was built?
‘Dwellings’ is a dataset containing information about housing in Denver. Using this dataset, can we train a machine learning algorithm to correctly differentiate between houses built prior to 1980, and which factors are the most important? Let’s create a classification model to find out.
Here is a short table showing what the data looks like:
Show the code
import pandas as pdimport altair as altimport numpy as npimport plotly.express as pximport plotly.graph_objects as goimport plotly.graph_objs as gofrom plotly.subplots import make_subplotsfrom tabulate import tabulateimport pandas as pdfrom sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_split # Import train_test_split functionfrom sklearn import metrics #Import scikit-learn metrics module for accuracy calculation# Read csv into a pandas objecthousing = pd.read_csv('https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv')
Show the code
housing.head()
parcel
abstrprd
livearea
finbsmnt
basement
yrbuilt
totunits
stories
nocars
numbdrm
...
arcstyle_THREE-STORY
arcstyle_TRI-LEVEL
arcstyle_TRI-LEVEL WITH BASEMENT
arcstyle_TWO AND HALF-STORY
arcstyle_TWO-STORY
qualified_Q
qualified_U
status_I
status_V
before1980
0
00102-08-065-065
1130
1346
0
0
2004
1
2
2
2
...
0
0
0
0
0
1
0
1
0
0
1
00102-08-073-073
1130
1249
0
0
2005
1
1
1
2
...
0
0
0
0
0
1
0
1
0
0
2
00102-08-078-078
1130
1346
0
0
2005
1
2
1
2
...
0
0
0
0
0
1
0
1
0
0
3
00102-08-081-081
1130
1146
0
0
2005
1
1
0
2
...
0
0
0
0
0
1
0
1
0
0
4
00102-08-086-086
1130
1249
0
0
2005
1
1
1
2
...
0
0
0
0
0
0
1
1
0
0
5 rows × 51 columns
What Factors are Most Important?
To help train the algorithm, we need to identify variables in the data that the algorithm can use to sort each house. One way to identify the variables is to find variables that have high positive or negative correlation with houses being built before 1980. We can use a heatmap that shows how much of a correlation each variable has, linearly, between -1 and 1. Below we can see that correlation where -1 is represented as Red, and +1 is represented in Blue. We will pick variables that are the most red or blue.
Show the code
import matplotlib as pltimport pandas as pdfrom plotly.subplots import make_subplotshousing2 = housing.drop(columns=housing.columns[0]) # Drop the first columncorr = housing2.corr()# Create heatmap plot with Plotlydata = [go.Heatmap(z=[corr['yrbuilt']], colorscale='RdBu', x=housing2.columns)]# Create subplotsfig = make_subplots(rows=1, cols=1)# Add heatmap trace to the subplotfig.add_trace(data[0], row=1, col=1)# Update layoutfig.update_layout(title_text='Heatmap for yrbuilt', height=600)# Make y-axis scrollablefig.update_yaxes(fixedrange=False)# Show plotfig.show()
From the chart, the following variables have been selected:
Show the code
# Define the variables of interestvariables_of_interest = ['stories', 'numbaths','quality_C', 'gartype_Att','gartype_Det','arcstyle_ONE-STORY','livearea','sprice','status_V',"abstrprd","finbsmnt","nocars","deduct"]# Calculate the correlations with 'yrbuilt'correlation_with_yrbuilt = housing2[variables_of_interest].corrwith(housing2['yrbuilt'])# Create a DataFrame to store the correlationscorrelation_table = pd.DataFrame({'Variable': variables_of_interest, 'Correlation with yrbuilt': correlation_with_yrbuilt})# Display the correlation tableprint(correlation_table.to_string(index=False))
Let’s take a closer look at two variables to see visually why they might correlate. We will look at numbaths and status_v, which have a .38 and .30 positive correlation respectively.
clf2 = RandomForestClassifier()# Train Decision Tree Classifermodel = clf2.fit(X_train,y_train)#Predict the response for test datasety_pred = model.predict(X_test)print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.9105389482871482
We successfully got the model over 90% accuracy.
Feature Importances
Let’s take a look at which features we used were most important in the model.
Show the code
# Create a pandas Series with feature names as indexfeature_importances_series = pd.Series(model.feature_importances_, index=feature_cols)# Plot feature importancesfeature_importances_series.plot(kind='bar')plt.pyplot.title('Feature Importances')plt.pyplot.xlabel('Feature')plt.pyplot.ylabel('Importance')plt.pyplot.show()
It is interesting to note that the importance of features does not necessarily align with how much variables correlated, interestingly enough. For example, status_V had a higher correlation with year built than livearea, but was the least important feature, while livearea was the most.
Using feature importance, I was also able to go back and reselect some additional factors that had high feature importance.
Model Justification
Below you can see the actual tree used.
Show the code
from sklearn.tree import export_graphvizimport graphvizestimator = model.estimators_[0]# Export the decision tree to a DOT fileexport_graphviz(estimator, out_file='tree.dot', feature_names=feature_cols, filled=True, rounded=True)# Convert the DOT file to an image (PNG)graphviz.render('dot', 'png', 'Tree.dot')
'Tree.dot.png'
Decision Tree
Compared to other models, the random forest model was more accurate by about .02%. It used livearea as a high importance feature. We can theorize that livearea, or the liveable area of a house might correlate highly with the year a house was built because different model years could have related livable areas. At any rate, the random forest classifier found livearea to be the highest importance and it worked well for the model’s performace.
Peformance
Below you can see the model’s accuracy, precision, and F1 score.
Accuracy: 0.9105389482871482
Precision: 0.928521373510862
F1 Score: 0.9281961471103327
These scores indicate that the model is effective at making predictions.
Conclusion
Surprisingly, the feature importance for the model did not necessarily correlate with the linear correlation of a feature with year built. The random forest classification model performed well compared to other models and seemed an effective fit for the task due to it’s high accuracy, precision, and f1 score, which were higher than other models for the same dataset.